There are a number of tools available in the Tools page of the 1000 Genomes Browser.
Our data is in standard formats like SAM and VCF, which have tools associated with them. To manipulate SAM/BAM files look at SAMtools for a C based toolkit and links to APIs in other languages. To interact with VCF files look at VCFtools which is a set of Perl and C++ code.
We also provide a public MySQL instance with copies of the databases behind the 1000 Genomes Ensembl browsers. These databases are described on our public instance page.
There are no official torrents of the 1000 Genomes Project data sets.
We provide a public MySQL instance with copies of the databases behind the 1000 Genomes Project Ensembl browser. These databases are described on our public instance page. More information about the browsers and their history can be found on the browsers page.
The 1000 Genomes raw sequence data represents more then 30,000x coverage of the human genome and there are no tools currently available to search against the complete data set. You can, however, use the Ensembl or NCBI BLAST services and then use these results to find 1000 Genomes Project variants in dbSNP.
We provide a VCF to PED tool to convert from VCF to PLINK PED format. This tool has documentation for both the web interface and the Perl script.
An example Perl command to run the script would be:
perl vcf_to_ped_converter.pl -vcf ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr13.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.gz
-sample_panel_file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.sample_panel
-region 13:32889611-32973805 -population GBR -population FIN
At the end of the 1000 Genomes Project, a large volume of the 1000 Genomes data (the majority of the FTP site) was available on the Amazon AWS cloud as a public data set.
At the end of the 1000 Genomes Project, the IGSR was established and the FTP site has been further developed since the conclusion of the 1000 Genomes Project, adding additional data sets. The Amazon AWS cloud reflects the data as it was at the end of the 1000 Genomes Project and does not include any updates or new data.
You can find more information about how to use the data in the Amazon AWS cloud on our AWS help page.
Either the Data Slicer or using a combination of tabix and VCFtools allows you to sub sample VCF files for a particular individual or list of individuals.
The Data Slicer, described in more detail in the documentation, has both filter by individual and population options. The individual filter takes the individual names in the VCF header and presents them as a list before giving you the final file. If you wish to filter by population, you also must provide a panel file which pairs individuals with populations, again you are presented with a list to select from before being given the final file, both lists can have multiple elements selected.
To use tabix you must also use a VCFtools Perl script called vcf-subset. The command line would look like:
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz
Our pilot data is all presented with respect to NCBI36 and our main project data is all presented with respect to GRCh37. If you need variant calls to be in a particular assembly it is best to go to dbSNP, Ensembl or an equivalent archive using their rs numbers as this will provide a definitive mapping.
If an rs number or equivalent is not available there are tools available to map between NCBI36, GRCh37 and GRCh38 from both Ensembl and the NCBI
The developers of Beagle, Mach and Impute2 have all created data sets based on the 1000 Genomes data to use for imputation.
Please look at the software’s website to find those files.
The 1000 Genomes Project has used several different alignment algorithms during its duration:
Project stage | Sequencing technology | Alignment algorithm |
---|---|---|
Pilot | Illumina | MAQ |
Pilot | SOLiD | Corona lite |
Pilot | 454 | ssaha |
Main | Illumina | BWA |
Main | SOLiD | BFAST |
Main | 454 | ssaha (first set) |
Main | 454 | smalt (final set) |
The full process is described in the README
Our VCF files contain global and super population alternative allele frequencies. You can see this in our most recent release. For multi allelic variants, each alternative allele frequency is presented in a comma separated list.
An example info column which contains this information looks like
1 15211 rs78601809 T G 100 PASS AC=3050;AF=0.609026;AN=5008;NS=2504;DP=32245;EAS_AF=0.504;AMR_AF=0.6772;AFR_AF=0.5371;EUR_AF=0.7316;SAS_AF=0.6401;AA=t|||;VT=SNP
If you want population specific allele frequencies you have three options: * For a single variant you can look at the population genetics page for a variant in our browser. This gives you piecharts and a table for a single site. * For a genomic region you can use our allele frequency calculator tool which gives a set of allele frequencies for selected populations * If you would like sub population allele frequences for a whole file, you are best to use the vcftools command line tool.
This is done using a combination of two vcftools commands called vcf-subset and fill-an-ac
An example command set using files from our phase 1 release would look like
grep CEU integrated_call_samples.20101123.ALL.panel | cut -f1 > CEU.samples.list
vcf-subset -c CEU.samples.list ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | fill-an-ac |
bgzip -c > CEU.chr13.phase1.vcf.gz
</pre>
Once you have this file you can calculate your frequency by dividing AC (allele count) by AN (allele number).
Please note that some early VCF files from the main project used LD information and other variables to help estimate the allele frequency. This means in these files the AF does not always equal AC/AN. In the phase 1 and phase 3 releases, AC/AN should always match the allele frequency quoted.
There are two ways to get subsections of our BAM files.
The first is to use the Data Slicer tool from our browser which is documented here. This tool gives you a web interface requesting the URL of any BAM file and the genomic location you wish to get a sub-slice for. This tool also works for VCF files.
The second it to use samtools on the command line, e.g
samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam 17:7512445-7513455
Samtools supports streaming files and piping commands together both using local and remote files. You can get more help with samtools from the samtools help mailing list
There are two ways to get a subset of a VCF file.
The first is to use the Data Slicer tool from our browser which is documented here. This tool gives you a web interface requesting the URL of any VCF file and the genomic location you wish to get a sub-slice for. This tool also works for BAM files. This tool also allows you to filter the file for particular individuals or populations if you also provide a panel file.
The second method is using tabix on the command line. e.g
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768
Specifications for the VCF format, and a C++ and Perl tool set for VCF files can be found at vcftools on sourceforge
Please note that all our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a vcf file using a chromosome name in the style chrN as shown below it will not work.
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768
The International Genome Sample Resource (IGSR) has stopped mirroring sequence files from the ENA but instead using the sequence.index files to point to the FTP location for the fastq file.
e.g ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR008/ERR008901/ERR008901_1.fastq.gz
These files can also be downloaded using aspera. You will need to get the ascp program as described in how to download files using aspera
Then you will need to change the ENA FTP host to the ENA Aspera host.
This means you need to change the FTP url to something suitable for the ascp command:
e.g ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR008/ERR008901/ERR008901_1.fastq.gz
becomes
fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR008/ERR008901/ERR008901_1.fastq.gz
You aspera command would need to look like
ascp -i bin/aspera/etc/asperaweb_id_dsa.openssh -Tr -Q -l 100M -L- fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR008/ERR008901/ERR008901_1.fastq.gz ./
For further information, please contact info@1000genomes.org. Full documentation about how to use aspera to download files from the ENA please see their document Downloading sequence files
Aspera provides a fast method of downloading data. To use the Aspera service you need to download the Aspera connect software. This provides a bulk download client called ascp.
For the command line tool ascp, for versions 3.3.3 and newer, you need to use a command line like:
ascp -i bin/aspera/etc/asperaweb_id_dsa.openssh -Tr -Q -l 100M -P33001 -L- fasp-g1k@fasp.1000genomes.ebi.ac.uk:vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ./
For versions 3.3.2 and older, you need to use a command line like:
ascp -i bin/aspera/etc/asperaweb_id_dsa.putty -Tr -Q -l 100M -P33001 -L- fasp-g1k@fasp.1000genomes.ebi.ac.uk:vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ./
Note, the only change between these commands is that for newer versions of ascp asperaweb_id_dsa.openssh replaces asperaweb_id_dsa.putty. This change is noted by Aspera here. You can check the version of ascp you have using:
ascp --version
The argument to -i may also be different depending on the location of the default key file. The command should not ask you for a password. All the IGSR data is accessible without a password but you do need to give ascp the ssh key to complete the command.
Some of the data we provide URLs for is hosted on the ENA FTP site. ENA provide information on using Aspera with their FTP site.
As an example of downloading a file from ENA, you could use a command line like:
ascp -i bin/aspera/etc/asperaweb_id_dsa.openssh -Tr -Q -l 100M -P33001 -L-
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR008/ERR008901/ERR008901_1.fastq.gz ./
If you are unsure of the location of asperaweb_id_dsa.openssh or asperaweb_id_dsa.putty, Aspera provide some documentation on where these will be found on different systems.
For the above commands to work with your network’s firewall you need to open ports 22/tcp (outgoing) and 33001/udp (both incoming and outgoing) to the following EBI IPs:
If the firewall has UDP flood protection, it must be turned off for port 33001.
Our aspera browser interace no longer works. If you wish to download files using a web interface we recommend using the Globus interface we present. If you are previously relied on the aspera web interface and wish to discuss the matter please email us at info@1000genomes.org to discuss your options.
For further information, please contact info@1000genomes.org.
All our variant call releases since 20100804 have come with a panel file. This file lists all the individuals who are part of the release and the population they come from.
This is a tab delimited file which must have sample and population in its first two columns; some files may then have subsequent columns which describe additional information like which super population a sample comes from or what sequencing platforms have been used to generate sequence data for that sample.
The panel files have names like integrated_call_samples.20101123.ALL.panel or integrated_call_samples_v2.20130502.ALL.panel
These panel files can be used by our browser tools, the Data Slicer, Variant Pattern Finder and vcf to ped converter to establish population groups for filtering
The 1000 Genomes data is available via ftp, http and Aspera. Any standard tool like wget or ftp should be able to download from our ftp or http mounted sites. To use Aspera you need to download their client.
The VCF files on our site cover a wide variety of different versions but our most recent release VCF files are in format version 4.1
In some early main project releases the allele frequency (AF) was estimated using additional information like LD, mapping quality and Haplotype information. This means in these releases the AF was not always the same as allele count/allele number (AC/AN). In the phase 1 release AF should always match AC/AN rounded to 2 decimal places.
There are two main reasons a tabix fetch might fail.
All our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a VCF file using a chromosome name in the style chrN as shown below it will not work.
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804 ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768
Also tabix does not fail when streaming remote files but instead just stops streaming. This can lead to incomplete lines with final rows with unexpected numbers of columns when trying to stream large sections of the file. The only way to avoid this is to download the file and work with it locally.